A preemptive thanks is due to both Alex Papiu and Joshua Ellis who also have made some excellent assessments on this data, and which some of this code in this script is directly taken from. Thank you both.
I will give a brief summary of the data cleaning process:
primary_results.csv and county_facts.csv to A) create a data frame votes using dplyr that gives the Democratic winner by county and B) to use some demographic features.affiliation to identify the red and blue states.votes dataframe by county and by state.# Import packages
libs <- c("dplyr","DT","GGally","ggplot2","gridExtra","plotly","stargazer")
lapply(libs, require, character.only = TRUE)
# Import data
demographics <- read.csv("../input/county_facts.csv",stringsAsFactors = F)
demographics <- subset(demographics,state_abbreviation !="")
primary <- read.csv("../input/primary_results.csv",stringsAsFactors = F)
# Current red/blue states by Gallop
# http://www.gallup.com/poll/188969/red-states-outnumber-blue-first-time-gallup-tracking.aspx
solidred = c("ID","MT","WY","UT","ND","SD","KS","OK","TN","AL","SC","AK")
solidblue = c("MA","RI","CT","NJ","DE","MD","VT","NY","IL","NM","CA")
leanred = c("NH","WV","IN","MS","AR","MO","TX","NE")
leanblue = c("WA","OR")
comeptitive = c("ME","PA","VA","NC","GA","FL","OH","LA","MI","WI","IA","MN","CO","AZ","NV")
primary$affiliation = ifelse(primary$state_abbreviation %in% solidred, "Solid Red",
ifelse(primary$state_abbreviation %in% solidblue, "Solid Blue",
ifelse(primary$state_abbreviation %in% leanred, "Lean Red",
ifelse(primary$state_abbreviation %in% leanblue, "Lean Blue",
ifelse(primary$state_abbreviation %in% comeptitive, "Swing", 0 )))))
votes.d <- primary %>% #get the winners and the fraction of votes the won
filter(party == "Democrat") %>%
group_by(state_abbreviation, county, affiliation) %>%
summarize(winner = candidate[which.max(fraction_votes)],
Vote = max(fraction_votes),
votes = max(votes))
demographics %<>%
# filter(state_abbreviation %in% c("IA", "NV", "SC")) %>%
select(state_abbreviation = state_abbreviation, county = area_name,
income = INC110213, college = EDU685213, density = POP060210,
hispanic = RHI725214, black = RHI225214, white= RHI825214,
asian = RHI425214, natives = RHI525214,
belowpov = PVY020213) %>%
mutate(county = gsub(" County", "", county))
# make sure to join by state too since some county names overlap
votes.d <- inner_join(votes.d, demographics, by = c("state_abbreviation","county"))
The first type of observation we will consider is the distibution of vote share by college and race.
There’s a skewness here. Most of the counties considered have less than 50% of their populations with a bacchelors degree. Among this population, most of the vote share goes to Clinton. If you look to the right side of this chart (as well as the two charts following), you will see the same distribution as split into four different state affiliations. Not only does Clinton gets most of her vote share by counties with a relatively small share of college graduates in its population, but she also gets most of her votes by states which are Red (Republican) and in the middle (Swing states). However, the gap of vote share between Clinton and Sanders begins to decline as you head up to the counties with a larger share of college graduates in its population.
Clinton tends to win the counties that are white over sanders, regardless of how much white people make up the population. The less white counties tend to be more likely to side with Clinton, which may be strange to think about when you consider what this may mean for the votings decisions of the other racial groups in comparion to the popularity of Bernie Sanders.
Since Im taking into account the other ethnicities (minus natives and mixed), I think we can expect a duality. That is, if the majority of the counties population have a majority of their population being white, then its bound to be the case that those same number of counties have a lower share of the population for other races. Regardless of this notion, the result is pretty much the same. Clinton is just winning (particularly among the Red and Swing states). But in saying that, that means there is a bit of a strong support of Sanders in the Blue (Democrat) states.
Before I run a linear model, I want to look at trends against fraction_votes for both candidates.
It appears these plots support my analysis on college and vote share between Clinton and Sanders that as you have more college attainment in a county, you’re more likely to vote Sanders in the 2016 Presidential Election season, whereas the opposite relationship holds for Clinton. I know I didnt include any age variables, but the college variable can be a little indicative of how young the supporting population may be (I would have to look more into that). I say that because Sanders is very popular among young voters (Blake, 2016).
This case was kind of confusing but I think I got it sorted out: I had to remind myself, and you, that this is about vote share. Recall back to Figure 2. What was notable about this observation was that for counties with low levels of the white population, the vote share appears to favor Clinton more than Sanders, and yet once we move up that latter to counties with larger numbers of caucasians, we seem to get more support for Sanders. Im not sure how convincing this is, but I see this trend is accompanied with high college attainment as mentioned above.
The vote share for clinton among the black/hispanic/asian racial groupings appear to be consistent among Red and Swing states, regardless of proportion. The relationship shows a positive trend for Clinton and a negative one for Sanders, suggesting that perhaps I shouldn’t expect a positive support for Sanders by these 3 racial groups.
In assesing the relationship between fraction of vote shares and socioeconomic cleavages, I put together this model:
\[ {Y}_i = \alpha + {X}\beta + \epsilon \]
where…
Its nothing particularly special: I suppose this is a general model anyone with intermediate statsistics experience would have tried to use, and reasonably so. Given the data collected by the Kaggle Crew, I dont think there is anything else available that can actually explain vote share, notwithstanding the obvious omission of native american, pacific islanders, mixed races and age groups.
By now I think I know what to expect. Given that we’re dealing with two kinds of outcomes (voting share between Sanders and Clinton), I think I am to expect duality to creep its way into the model output. The question that I want to look into is difference in model output between Clinton and Sanders by affiliated states. More specifically, how well do Clinton and Sanders do in Red states, Blue states, and Swing states?
| fraction_votes | ||||||
| Clinton (Red) | Sanders (Red) | Clinton (Blue) | Sanders (Blue) | Clinton (Swing) | Sanders (Swing) | |
| (1) | (2) | (3) | (4) | (5) | (6) | |
| college | -0.446*** | 0.557*** | -0.315*** | 0.406*** | -0.428*** | 0.437*** |
| (0.068) | (0.063) | (0.077) | (0.080) | (0.050) | (0.048) | |
| white | 0.229*** | -0.198*** | -0.051 | 0.022 | 0.131** | -0.100 |
| (0.050) | (0.046) | (0.098) | (0.103) | (0.066) | (0.062) | |
| asian | 0.320 | -0.282 | 0.321** | -0.346** | 0.370 | -0.209 |
| (0.396) | (0.365) | (0.160) | (0.167) | (0.256) | (0.242) | |
| hispanic | 0.420*** | -0.419*** | 0.195** | -0.195* | 0.161** | -0.178*** |
| (0.049) | (0.045) | (0.097) | (0.102) | (0.072) | (0.068) | |
| belowpov | 0.062 | -0.051 | -0.201 | 0.147 | -0.031 | -0.019 |
| (0.113) | (0.104) | (0.188) | (0.197) | (0.099) | (0.093) | |
| black | 0.997*** | -0.903*** | 0.669*** | -0.748*** | 0.864*** | -0.827*** |
| (0.048) | (0.044) | (0.103) | (0.108) | (0.064) | (0.061) | |
| income | 0.145* | -0.043 | 0.191** | -0.242*** | 0.222*** | -0.167*** |
| (0.076) | (0.070) | (0.079) | (0.082) | (0.057) | (0.054) | |
| Constant | 24.377*** | 61.803*** | 41.965*** | 60.813*** | 31.679*** | 60.918*** |
| (7.528) | (6.932) | (12.467) | (13.031) | (7.905) | (7.476) | |
| N | 1,269 | 1,269 | 397 | 397 | 936 | 936 |
| R2 | 0.508 | 0.541 | 0.425 | 0.418 | 0.612 | 0.649 |
| Adjusted R2 | 0.505 | 0.538 | 0.415 | 0.408 | 0.609 | 0.646 |
| Residual Std. Error | 12.636 (df = 1261) | 11.635 (df = 1261) | 9.238 (df = 389) | 9.656 (df = 389) | 8.876 (df = 928) | 8.395 (df = 928) |
| F Statistic | 185.932*** (df = 7; 1261) | 212.110*** (df = 7; 1261) | 41.062*** (df = 7; 389) | 39.954*** (df = 7; 389) | 208.714*** (df = 7; 928) | 245.104*** (df = 7; 928) |
| Notes: | ***Significant at the 1 percent level. | |||||
| **Significant at the 5 percent level. | ||||||
| *Significant at the 10 percent level. | ||||||
Based on my results, I find that, yet again, Clinton is winning virtually across the board vs Sanders. At this point it is probably more interesting to note that Sanders’ recieves support from college graduates from all states. Another thing to note is the significance of the belowpov variable, which is only significant in blue states, which is interesting given how democrats tend to lean towards more socialist solutions to problems. I think Hillary’s massive support stems from the Clinton brand, particularly among blacks, hispanics and asians. It makes sense, given that a lot of the younger voters prefer Sanders.
It is important to note that the model I used was VERY general and likely misspecified. Given the age variables included, I believe there is more to explore. I excluded an age variable from the model because here is no given variable indicative of young voters. I could probably have engineered a variable or two to get to this conclusion, but I relied upon the strength of the college variable to loosely imply that Bernie’s support is tied to younger voters, as indicated by the data I used.
Furthermore, the adjusted R2 value ranges from roughly 45 to 60 for each model. More specifically, the model doesnt seem to fit the voting share outcome for the Blue states (around 40s R2) as well as it does for the Red and Swing states. This can be due to the greater number of Red and Swing states in the dataset than Blue states. Even if I had all of the data from all the states, there would still be more Red than Blue states (perhaps not the same for Swing states), so the model outputs would probably look different. What this difference in observation count may mean about how I should approach the R2 is not very clear. Having a model fit of even 0.15 can be argued as plausible, depending on the story, so 0.43 is a decent fit, but I dont want to stay on this kind of topic longer than I need to, given how overrated R2s tend to be.
As a last observation, I will look into the diagnostics of the models. Because Im dealing with county-leveled data, I do expect there to be some serial correlation, particularly of the spaital kind. Spatial autocorrelation is beyond the scope of this excersize, so I wont venture too much into that, but it doesnt change the potential existence of autocorrelation.
The residual plots above confirm to me that there is autocorrelation, given the apparent trend in the residuals. This confirms to me the bias in my model and theres much more work to be done. Some variables are either over or under estimated to the true value; that can cause problems in a more serious study. I focused more on the signs rather than the coeffiicents for that reason. The signs for the most part appear to make sense to the previous figures I looked into.
More can definately be said about the determinants of voting share than with the variables I used in this excersize. Some variables I think would really be useful in further excersizes are campaign spending data (Webber & Cutts, 2010). Another group of variables I think would be useful to look into are employment type variables, such as the number of people working in the public sector, transit, healthcare, etc. Maybe a look into how many people have white collar and blue collar jobs for each county would be good (for this option, I dont know if its plausible right now).
This excersize was pretty fun. The models that I ran showed me Clinton was very much dominating the vote share through and through against Sanders. That being said, if this was a very serious study, I would have wasted a bit of my time. This duality concept that I put forward in observation proved to have made its way into my regression results. While each model provided the same significant variables for both candidates, I was pretty much better off focusing on just (either) one of the candidates given their unique situation. This was a 2 man (well, man-woman) race for a while (up until a few months ago), and given this dual relationship, whatever trends you saw for one candidate would be associated with a reversed kind of trend for the other. In this case, where Clinton had negative relationships, Sanders had positive ones and vice versa.
For those who care to know about Primaries and Caucuses (if you dont), heres a video.
This is crucial because it turns out, as Oliver hinted, the Democratic Party has two ways for people to vote: those are the Primaries (voting all day long) and Caucuses (public voting in a specific time). Winning primaries and caucuses (and by particular margins) are ways to get a lot of Delegates and Super Delegates (the latter exclusive to the Democratic Party). They are needed to get the Democratic nomination. Clinton has been ahead of Sanders in the same way Obama was ahead of Clinton back in 2008 primarily because of the delegate and the super delegate count, which indicates that more supporters of Clinton are able to head on to the Primaries and Caucuses. If it seems like Sanders has been winning states and gaining popularity, well youre not kidding yourself. He has been very popular among independent voters (Hanley, 2016; Hopkins, 2016) and has been dominating Caucuses. The problem with that is the barriers to voting for independent voters in states with closed Primaries (like NY), where not being affiliated to a party can screw you out of a chance to vote, and that not every state has a Caucus. I think its fair to assume Sanders is the champion of the Caucuses, while Clinton has been the champion of the Primaries.